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Abstract With rapid advances in neuroimaging tech¬ 
niques, the research on brain disorder identification has 
become an emerging area in the data mining commu¬ 
nity. Brain disorder data poses many unique challenges 
for data mining research. For example, the raw data 
generated by neuroimaging experiments is in tensor rep¬ 
resentations, with typical characteristics of high dimen¬ 
sionality, structural complexity and nonlinear separa¬ 
bility. Furthermore, brain connectivity networks can be 
constructed from the tensor data, embedding subtle in¬ 
teractions between brain regions. Other clinical mea¬ 
sures are usually available reflecting the disease status 
from different perspectives. It is expected that integrat¬ 
ing complementary information in the tensor data and 
the brain network data, and incorporating other clinical 
parameters will be potentially transformative for inves¬ 
tigating disease mechanisms and for informing thera¬ 
peutic interventions. Many research efforts have been 
devoted to this area. They have achieved great success 
in various applications, such as tensor-based modeling, 
subgraph pattern mining, multi-view feature analysis. 
In this paper, we review some recent data mining meth¬ 
ods that are used for analyzing brain disorders. 
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1 Introduction 

Many brain disorders are characterized by ongoing in¬ 
jury that is clinically silent for prolonged periods and 
irreversible by the time symptoms first present. New 
approaches for detection of early changes in subclin- 
ical periods will afford powerful tools for aiding clin¬ 
ical diagnosis, clarifying underlying mechanisms and 
informing neuroprotective interventions to slow or re¬ 
verse neural injury for a broad spectrum of brain disor¬ 
ders, including bipolar disorder, HIV infection on brain, 
Alzheimer’s disease, Parkinson’s disease, etc. Early di¬ 
agnosis has the potential to greatly alleviate the burden 
of brain disorders and the ever increasing costs to fam¬ 
ilies and society. 

As the identification of brain disorders is extremely 
challenging, many different diagnosis tools and methods 
have been developed to obtain a large number of mea¬ 
surements from various examinations and laboratory 
tests. Especially, recent advances in the neuroimaging 
technology have provided an efficient and noninvasive 
way for studying the structural and functional connec¬ 
tivity of the human brain, either normal or in a diseased 
state [48 . This can be attributed in part to advances 
in magnetic resonance imaging (MRI) capabilities f33j . 
Techniques such as diffusion MRI, also referred to as 
diffusion tensor imaging (DTI), produce in vivo images 
of the diffusion process of water molecules in biological 
tissues. By leveraging the fact that the water molecule 
diffusion patterns reveal microscopic details about tis¬ 
sue architecture, DTI can be used to perform tractog- 
raphy within the white matter and construct structural 
connectivity networks [2ll36lll2ll38ll40j . Functional MRI 
(fMRI) is a functional neuroimaging procedure that 
identifies localized patterns of brain activation by de¬ 
tecting associated changes in the cerebral blood flow. 
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The primary form of fMRI uses the blood oxygenation 
level dependent (BOLD) response extracted from the 
gray matter mug. Another neuroimaging technique 
is positron emission tomography (PET). Using differ¬ 
ent radioactive tracers (e.g., fluorodeoxyglucose), PET 
produces a three-dimensional image of various physio¬ 
logical, biochemical and metabolic processes [68] . 

A variety of data representations can be derived 
from these neuroimaging experiments, which present 
many unique challenges for the data mining community. 
Conventional data mining algorithms are usually devel¬ 
oped to tackle data in one specific representation, a ma¬ 
jority of which are particularly for vector-based data. 
However, the raw neuroimaging data is in the form 
of tensors, from which we can further construct brain 
networks connecting regions of interest (ROIs). Both 
of them are highly structured considering correlations 
between adjacent voxels in the tensor data and that 
between connected brain regions in the brain network 
data. Moreover, it is critical to explore interactions be¬ 
tween measurements computed from the neuroimaging 
and other clinical experiments which describe subjects 
in different vector spaces. In this paper, we review some 
recent data mining methods for (1) mining tensor imag¬ 
ing data; (2) mining brain networks; (3) mining multi¬ 
view feature vectors. 


2 Tensor Imaging Analysis 

For brain disorder identification, the raw data gener¬ 
ated by neuroimging experiments are in tensor repre¬ 
sentations mm- For example, in contrast to two- 
dimensional X-ray images, an fMRI sample corresponds 
to a four-dimensional array by recording the sequential 
changes of traceable signals in each voxeQ 

Tensors are higher order arrays that generalize the 
concepts of vectors (first-order tensors) and matrices 
(second-order tensors), whose elements are indexed by 
more than two indices. Each index expresses a mode of 
variation of the data and corresponds to a coordinate 
direction. In an fMRI sample, the first three modes usu¬ 
ally encode the spatial information, while the fourth 
mode encodes the temporal information. The number 
of variables in each mode indicates the dimensionality 
of a mode. The order of a tensor is determined by the 
number of its modes. An mth-order tensor can be rep¬ 
resented as A = £ M 7lX -xTn 5 w p ere I { is 

the dimension of X along the i-th mode. 

Definition 1 (Tensor product) The tensor product 
of three vectors a E M Jl , b E M 7 " 1 2 and c E M 7 " 3 , denot ed 

1 A voxel is the smallest three-dimensional point volume 

referenced in a neuroimaging of the brain. 
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Fig. 1 Tensor factorization of a third-order tensor. 


by a (g) b 0 c, represents a third-order tensor with the 
elements (a (g)b(g)c) ii? . 2? . 3 = a^b^c^. 

Tensor product is also referred to as outer product 
in some literature. An mth-order tensor is a rank-one 
tensor if it can be defined as the tensor product of m 
vectors. 

Definition 2 (Tensor factorization) Given a third- 
order tensor A G M 7 i x/ 2 x/3 an q an integer R, as illus¬ 
trated in Figure [l] a tensor factorization of A can be 
expressed as 


R 

X = X\ T X 2 T • • • T Xji — 'y ^ a r (g) b r (g) c r (1) 

r= 1 

One of the major difficulties brought by the ten¬ 
sor data is the curse of dimensionality. The total num¬ 
ber of voxels contained in a multi-mode tensor, say, 
X = (^q,...,z m ) £ M /lX '" x/m is /1 x • • • x / m which is 
exponential to the number of modes. If we unfold the 
tensor into a vector, the number of features will be ex¬ 
tremely high [69] . This makes traditional data mining 
methods prone to overfitting, especially with a small 
sample size. Both computational scalability and theo¬ 
retical guarantee of the traditional models are compro¬ 
mised by such high dimensionality [22]. 

On the other hand, complex structural information 
is embedded in the tensor data. For example, in the neu¬ 
roimaging data, values of adjacent voxels are usually 
correlated with each other [33] . Such spatial relation¬ 
ships among different voxels in a tensor image can be 
very important in neuroimaging applications. Conven¬ 
tional tensor-based approaches focus on reshaping the 
tensor data into matrices/vectors and thus the original 
spatial relationships are lost. The integration of struc¬ 
tural information is expected to improve the accuracy 
and interpretability of tensor models. 


2.1 Classification 

Suppose we have a set of tensor data V = {(Xi, 2/i)}f =1 
for classification problem, where Xi E is the 

neuroimaging data represented as an mth-order tensor 
and yi G { —1, +1} is the corresponding binary class la¬ 
bel of Xi. For example, if the i-th subject has Alzheimer’s 
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disease, the subject is associated with a positive label, 
z.e., i/i = +1. Otherwise, if the subject is in the control 
group, the subject is associated with a negative label, 
z.e., yi = -1. 

Supervised tensor learning can be formulated as the 
optimization problem of support tensor machines (STMs) 
[55] which is a generalization of the standard support 
vector machines (SVMs) from vector data to tensor 
data. The objective of such learning algorithms is to 
learn a hyperplane by which the samples with different 
labels are divided as wide as possible. However, ten¬ 
sor data may not be linearly separable in the input 
space. To achieve a better performance on finding the 
most discriminative biomarkers or identifying infected 
subjects from the control group, in many neuroimaging 
applications, nonlinear transformation of the original 
tensor data should be considered. He et al. study the 
problem of supervised tensor learning with nonlinear 
kernels which can preserve the structure of tensor data 
[22] . The proposed kernel is an extension of kernels in 
the vector space to the tensor space which can take the 
multidimensional structure complexity into account. 

2.2 Regression 

Slightly different from classifying disease status (dis¬ 
crete label), another family of problems use tensor neu¬ 
roimages to predict cognitive outcome (continuous la¬ 
bel). The problems can be formulated in a regression 
setup by treating clinical outcome as the real label, z.e., 
yi £ M, and treating tensor neuroimages as the input. 
However, most classical regression methods take vec¬ 
tors as input features. Simply reshaping a tensor into a 
vector is clearly an unsatisfactory solution. 

Zhou et al. exploit the tensor structure in imaging 
data and integrate tensor decomposition within a statis¬ 
tical regression paradigm to model multidimensional ar¬ 
rays [69]. By imposing a low rank approximation to the 
extremely high dimensional complex imaging data, the 
curse of dimensionality is greatly alleviated, thereby al¬ 
lowing development of a fast estimation algorithm and 
regularization. Numerical analysis demonstrates its po¬ 
tential applications in identifying regions of interest in 
brains that are relevant to a particular clinical response. 

2.3 Network Discovery 

Modern imaging techniques have allowed us to study 
the human brain as a complex system by modeling it 
as a network [T]. For example, the fMRI scans consist of 
activations of thousands of voxels over time embedding 
a complex interaction of signals and noise m, which 


naturally presents the problem of eliciting the underly¬ 
ing network from brain activities in the spatio-temporal 
tensor data. A brain connectivity network, also called a 
connectome [52], consists of nodes (gray matter regions) 
and edges (white matter tracts in structural networks 
or correlations between two BOLD time series in func¬ 
tional networks). 

Although the anatomical atlases in the brain have 
been extensively studied for decades, task/subject spe¬ 
cific networks have still not been completely explored 
with consideration of functional or structural connec¬ 
tivity information. An anatomically parcellated region 
may contain subregions that are characterized by dra¬ 
matically different functional or structural connectivity 
patterns, thereby significantly limiting the utility of the 
constructed networks. There are usually trade-offs be¬ 
tween reducing noise and preserving utility in brain par- 
cellation [33'. Thus investigating how to directly con¬ 
struct brain networks from tensor imaging data and 
understanding how they develop, deteriorate and vary 
across individuals will benefit disease diagnosis m- 

Davidson et al. pose the problem of network discov¬ 
ery from fMRI data which involves simplifying spatio- 
temporal data into regions of the brain (nodes) and re¬ 
lationships between those regions (edges) [15] . Here the 
nodes represent collections of voxels that are known to 
behave cohesively over time; the edges can indicate a 
number of properties between nodes such as facilita- 
tion/inhibition (increases/decreases activity) or proba¬ 
bilistic (synchronized activity) relationships; the weight 
associated with each edge encodes the strength of the 
relationship. 

A tensor can be decomposed into several factors. 
However, unconstrained tensor decomposition results 
of the fMRI data may not be good for node discovery 
because each factor is typically not a spatially contigu¬ 
ous region nor does it necessarily match an anatomi¬ 
cal region. That is to say, many spatially adjacent vox¬ 
els in the same structure are not active in the same 
factor which is anatomically impossible. Therefore, to 
achieve the purpose of discovering nodes while preserv¬ 
ing anatomical adjacency, known anatomical regions in 
the brain are used as masks and constraints are added 
to enforce that the discovered factors should closely 
match these masks m 

Overall, current research on tensor imaging analysis 
presents two directions: (1) supervised: for a particular 
brain disorder, a classifier can be trained by modeling 
the relationship between a set of neuroimages and their 
associated labels (disease status or clinical response); 
(2) unsupervised: regardless of brain disorders, a brain 
network can be discovered from a given neuroimage. 
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3 Brain Network Analysis 

We have briefly introduced that brain networks can be 
constructed from neuroimaging data where nodes corre¬ 
spond to brain regions, e.g ., insula , hippocampus , thala¬ 
mus , and links correspond to the functional/structural 
connectivity between brain regions. The linkage struc¬ 
ture in brain networks can encode tremendous informa¬ 
tion about the mental health of human subjects. For ex¬ 
ample, in brain networks derived from functional mag¬ 
netic resonance imaging (fMRI), functional connections 
can encode the correlations between the functional ac¬ 
tivities of brain regions. While structural links in diffu¬ 
sion tensor imaging (DTI) brain networks can capture 
the number of neural fibers connecting different brain 
regions. The complex structures and the lack of vector 
representations for the brain network data raise major 
challenges for data mining. 

Next, we will discuss different approaches on how 
to conduct further analysis for constructed brain net¬ 
works, which are also referred to as graphs hereafter. 

Definition 3 (Binary graph) A binary graph is rep¬ 
resented as G = (V, E), where V = {tq, • • • , v Uv } is the 
set of vertices, E C V x V is the set of deterministic 
edges. 

3.1 Kernel Learning on Graphs 

In the setting of supervised learning on graphs, the tar¬ 
get is to train a classifier using a given set of graph data 
V = {(Gi : yi)}^ =1 , so that we can predict the label y 
for a test graph G. With applications to brain networks, 
it is desirable to identify the disease status for a sub¬ 
ject based on his/her uncovered brain network. Recent 
development of brain network analysis has made char¬ 
acterization of brain disorders at a whole-brain connec¬ 
tivity level possible, thus providing a new direction for 
brain disease classification. 

Due to the complex structures and the lack of vector 
representations, graph data can not be directly used as 
the input for most data mining algorithms. A straight¬ 
forward solution that has been extensively explored is 
to first derive features from brain networks and then 
construct a kernel on the feature vectors. 

Wee et al. use brain connectivity networks for dis¬ 
ease diagnosis on mild cognitive impairment (MCI), 
which is an early phase of Alzheimer’s disease (AD) 
and usually regarded as a good target for early diag¬ 
nosis and therapeutic interventions [611l62[l63] . In the 
step of feature extraction, weighted local clustering co¬ 
efficients of each ROI in relation to the remaining ROIs 
are extracted from all the constructed brain networks to 


quantify the prevalence of clustered connectivity around 
the ROIs. To select the most discriminative features for 
classification, statistical t-test is performed and features 
with p-values smaller than a predefined threshold are 
selected to construct a kernel matrix. Through the em¬ 
ployment of the multi-kernel SVM, Wee et al. integrate 
information from DTI and fMRI and achieve accurate 
early detection of brain abnormalities [ 63 ]. 

However, such strategy simply treats a graph as a 
collection of nodes/links, and then extracts local mea¬ 
sures (e.g., clustering coefficient) for each node or per¬ 
forms statistical analysis on each link, thereby blind¬ 
ing the connectivity structures of brain networks. Mo¬ 
tivated by the fact that some data in real-world appli¬ 
cations are naturally represented by means of graphs, 
while compressing and converting them to vectorial rep¬ 
resentations would definitely lose structural informa¬ 
tion, kernel methods for graphs have been extensively 
studied for a decade [5]. 

A graph kernel maps the graph data from the orig¬ 
inal graph space to the feature space and further mea¬ 
sures the similarity between two graphs by comparing 
their topological structures [49]. For example, product 
graph kernel is based on the idea of counting the num¬ 
ber of walks in product graphs DEI; marginalized graph 
kernel works by comparing the label sequences gener¬ 
ated by synchronized random walks of labeled graphs 
m ; cyclic pattern kernels for graphs count pairs of 
matching cyclic/tree patterns in two graphs [23. 

To identify individuals with AD/MCI from healthy 
controls, instead of using only a single property of brain 
networks, Jie et al. integrate multiple properties of fMRI 
brain networks to improve the disease diagnosis perfor¬ 
mance m- Two different yet complementary network 
properties, z.e., local connectivity and global topologi¬ 
cal properties are quantified by computing two different 
types of kernels, i.e., a vector-based kernel and a graph 
kernel. As a local network property, weighted cluster¬ 
ing coefficients are extracted to compute a vector-based 
kernel. As a topology-based graph kernel, Weisfeiler- 
Lehman subtree kernel [49] is used to measure the topo¬ 
logical similarity between paired fMRI brain networks. 
It is shown that this type of graph kernel can effec¬ 
tively capture the topological information from fMRI 
brain networks. The multi-kernel SVM is employed to 
fuse these two heterogeneous kernels for distinguishing 
individuals with MCI from healthy controls. 

3.2 Subgraph Pattern Mining 

In brain network analysis, the ideal patterns we want 
to mine from the data should take care of both local 
and global graph topological information. Graph kernel 
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Fig. 3 An example of fMRI brain networks (left) and all pos¬ 
sible instantiations of linkage structures between red nodes 
(right) m- 


Fig. 2 An example of discriminative subgraph patterns in 
brain networks. 

methods seem promising, which however are not inter¬ 
pretable. Subgraph patterns are more suitable for brain 
networks, which can simultaneously model the network 
connectivity patterns around the nodes and capture the 
changes in local area [33] . 

Definition 4 (Subgraph) Let G' = (V 7 , E') and G = 

(V, E) be two binary graphs. G' is a subgraph of G 
(denoted as G' C G) iff V' C V and E' C E. If G' is a 
subgraph of G, then G is supergraph of G'. 

A subgraph pattern, in a brain network, represents a 
collection of brain regions and their connections. For ex¬ 
ample, as shown in Figure [2] three brain regions should 
work collaboratively for normal people and the absence 
of any connection between them can result in Alzheimer’s 
disease in different degree. Therefore, it is valuable to 
understand which connections collectively play a signifi¬ 
cant role in disease mechanism by finding discriminative 
subgraph patterns in brain networks. 

Mining subgraph patterns from graph data has been 
extensively studied by many researchers [29lll3li56ll66] . 
In general, a variety of filtering criteria are proposed. 
A typical evaluation criterion is frequency, which aims 
at searching for frequently appearing subgraph features 
in a graph dataset satisfying a prespecified threshold. 
Most of the frequent subgraph mining approaches are 
unsupervised. For example, Yan and Han develop a 
depth-first search algorithm: gSpan m- This algorithm 
builds a lexicographic order among graphs, and maps 
each graph to an unique minimum DFS code as its 
canonical label. Based on this lexicographic order, gSpan 
adopts the depth-first search strategy to mine frequent 
connected subgraphs efficiently. Many other approaches 
for frequent subgraph mining have also been proposed, 
e.g., AGM [26], FSG [34], MoFa [4], FFSM [24], and 
Gaston [41] . 


Moreover, the problem of supervised subgraph min¬ 
ing has been studied in recent work which examines 
how to improve the efficiency of searching the discrim¬ 
inative subgraph patterns for graph classification. Yan 
et al. introduce two concepts structural leap search and 
frequency-descending mining , and propose LEAP [66] 
which is one of the first work in discriminative sub¬ 
graph mining. Thoma et al. propose CORK which can 
yield a near-optimal solution using greedy feature selec¬ 
tion [56]. Rami and Singh propose a scalable approach, 
called GraphSig, that is capable of mining discrimi¬ 
native subgraphs with a low frequency threshold [46] . 
Jin et al. propose COM which takes into account the 
co-occurences of subgraph patterns, thereby facilitat¬ 
ing the mining process [28]. Jin et al. further propose 
an evolutionary computation method, called GAIA, to 
mine discriminative subgraph patterns using a random¬ 
ized searching strategy m- Zhu et al. design a diver¬ 
sified discrimination score based on the log ratio which 
can reduce the overlap between selected features by con¬ 
sidering the embedding overlaps in the graphs m • 

Conventional graph mining approaches are best suited 
for binary edges, where the structure of graph objects is 
deterministic, and the binary edges represent the pres¬ 
ence of linkages between the nodes [33] . In fMRI brain 
network data however, there are inherently weighted 
edges in the graph linkage structure, as shown in Fig¬ 
ure [3] (left). A straightforward solution is to threshold 
weighted networks to yield binary networks. However, 
such simplification will result in great loss of informa¬ 
tion. Ideal data mining methods for brain network anal¬ 
ysis should be able to overcome these methodological 
problems by generalizing the network edges to positive 
and negative weighted cases, e.g ., probabilistic weights 
in fMRI brain networks, integral weights in DTI brain 
networks. 

Definition 5 (Weighted graph) A weighted graph is 
represented as G = (V,E,p), where V = {rq, • • • ,v nv } 
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is the set of vertices, E C V x V is the set of nondeter- 
ministic edges, p : E —» (0,1] is a function that assigns 
a probability of existence to each edge in E. 

fMRI brain networks can be modeled as weighted 
graphs where each edge e E E is associated with a 
probability p(e) indicating the likelihood of whether 
this edge should exist or not mm- It is assumed that 
p(e) of different edges in a weighted graph are indepen¬ 
dent from each other. Therefore, by enumerating the 
possible existence of all edges in a weighted graph, we 
can obtain a set of binary graphs. For example, in Fig. [3] 
(right), consider the three red nodes and links between 
them as a weighted graph. There are 2 3 = 8 binary 
graphs that can be implied with different probabilities. 
For a weighted graph G, the probability of G contain¬ 
ing a subgraph feature G' is defined as the probability 
that a binary graph G implied by G contains subgraph 
G '. Kong et al. propose a discriminative subgraph fea¬ 
ture selection method based on dynamic programming 
to compute the probability distribution of the discrim¬ 
ination scores for each subgraph pattern within a set of 
weighted graphs [32] . 

For brain network analysis, usually we only have a 
small number of graph instances [32 . In these applica¬ 
tions, the graph view alone is not sufficient for mining 
important subgraphs. Fortunately, the side information 
is available along with the graph data for brain disor¬ 
der identification. For example, in neurological studies, 
hundreds of clinical, immunologic, serologic and cogni¬ 
tive measures may be available for each subject, apart 
from brain networks. These measures compose multi¬ 
ple side views which contain a tremendous amount of 
supplemental information for diagnostic purposes. It is 
desirable to extract valuable information from a plu¬ 
rality of side views to guide the process of subgraph 
mining in brain networks. 

Figure 4(a) | illustrates the process of selecting sub¬ 
graph patterns in conventional graph classification ap¬ 
proaches. Obviously, the valuable information embed¬ 
ded in side views is not fully leveraged in feature se¬ 
lection process. To tackle this problem, Cao et al. in¬ 
troduce an effective algorithm for discriminative sub¬ 
graph selection using multiple side views |9], as illus¬ 
trated in Figure |4(b)[ Side information consistency is 
first validated via statistical hypothesis testing which 
suggests that the similarity of side view features be¬ 
tween instances with the same label should have higher 
probability to be larger than that with different labels. 
Based on such observations, it is assumed that the sim¬ 
ilarity/distance between instances in the space of sub¬ 
graph features should be consistent with that in the 
space of a side view. That is to say, if two instances 
are similar in the space of a side view, they should also 



(a) Treating side views and subgraph patterns separately. 



(b) Using side views as guidance for the process of selecting 
subgraph patterns. 


Fig. 4 Two strategies of leveraging side views in feature se¬ 
lection process for graph classification. 


be close to each other in the space of subgraph fea¬ 
tures. Therefore the target is to minimize the distance 
between subgraph features of each pair of similar in¬ 
stances in each side view [9]. In contrast to existing 
subgraph mining approaches that focus on the graph 
view alone, the proposed method can explore multiple 
vector-based side views to find an optimal set of sub¬ 
graph features for graph classification. 

For graph classification, brain network analysis ap¬ 
proaches can generally be put into three groups: (1) 
extracting some local measures (e.g., clustering coeffi¬ 
cient) to train a standard vector-based classifier; (2) di¬ 
rectly adopting graph kernels for classification; (3) find¬ 
ing discriminative subgraph patterns. Different types of 
methods model the connectivity embedded in brain net¬ 
works in different ways. 


4 Multi-view Feature Analysis 

Medical science witnesses everyday measurements from 
a series of medical examinations documented for each 
subject, including clinical, imaging, immunologic, sero¬ 
logic and cognitive measures |7|, as shown in Figure [5j 
Each group of measures characterize the health state 
of a subject from different aspects. This type of data is 
named as multi-view data , and each group of measures 
form a distinct view quantifying subjects in one specific 
feature space. Therefore, it is critical to combine them 
to improve the learning performance, while simply con¬ 
catenating features from all views and transforming a 
multi-view data into a single-view data, as the method 
(a) shown in Figure [6j would fail to leverage the under¬ 
lying correlations between different views. 
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Serologic measures 


Clinical measures 


Fig. 5 An example of multi-view learning in medical studies 

m- 


4.1 Multi-view Learning 


Suppose we have a multi-view classification task with 
n labeled instances represented from m different views: 


V 


{( 


X (1) X (2) 
-‘V; i -‘V; i 


,X 


(m 




where x 


0 ) 


I v is the dimensionality of the u-th view, and yi G 
{ — 1, +1} is the class label of the i-th instance. 

Representative methods for multi-view learning can 
be categorized into three groups: co-training, multiple 
kernel learning, and subspace learning [65|. Generally, 
the co-training style algorithm is a classic approach 
for semi-supervised learning, which trains in alterna¬ 
tion to maximize the mutual agreement on different 
views. Multiple kernel learning algorithms combine ker¬ 
nels that naturally correspond to different views, either 
linearly [35] or nonlinearly [581114] to improve learn¬ 
ing performance. Subspace learning algorithms learn a 
latent subspace, from which multiple views are gener¬ 
ated. Multiple kernel learning and subspace learning are 
generalized as co-regularization style algorithms [53] . 
where the disagreement between the functions of differ¬ 
ent views is taken as a part of the objective function to 
be minimized. Overall, by exploring the consistency and 
complementary properties of different views, multi-view 
learning is more effective than single-view learning. 

In the multi-view setting for brain disorders, or for 
medical studies in general, a critical problem is that 
there may be limited subjects available (z.e., a small n) 
yet introducing a large number of measurements (z.e., 
a large Within the multi-view data, not all 


features in different views are relevant to the learning 
task, and some irrelevant features may introduce un¬ 
expected noise. The irrelevant information can even be 
exaggerated after view combinations thereby degrad¬ 
ing performance. Therefore, it is necessary to take care 
of feature selection in the learning process. Feature se¬ 
lection results can also be used by researchers to find 
biomarkers for brain diseases. Such biomarkers are clin¬ 
ically imperative for detecting injury to the brain in the 
earliest stage before it is irreversible. Valid biomarkers 
can be used to aid diagnosis, monitor disease progres¬ 
sion and evaluate effects of intervention [32 . 

Conventional feature selection approaches can be di¬ 
vided into three main directions: filter, wrapper, and 
embedded methods m ■ Filter methods compute a dis¬ 
crimination score of each feature independently of the 
other features based on the correlation between the 
feature and the label, e.g., information gain, Gini in¬ 
dex, Relief [441147] . Wrapper methods measure the use¬ 
fulness of feature subsets according to their predictive 
power, optimizing the subsequent induction procedure 
that uses the respective subset for classification Eiism 
[501157116] . Embedded methods perform feature selection 
in the process of model training based on sparsity reg¬ 
ularization [TTlITTniKflllfln] . For example, Miranda et al. 
add a regularization term that penalizes the size of the 
selected feature subset to the standard cost function of 
SVM, thereby optimizing the new objective function to 
conduct feature selection [39]. Essentially, the process 
of feature selection and learning algorithm interact in 
embedded methods which means the learning part and 
the feature selection part can not be separated, while 
wrapper methods utilize the learning algorithm as a 
black box. 

However, directly applying these feature selection 
approaches to each separate view would fail to lever¬ 
age multi-view correlations. By taking into account the 
latent interactions among views and the redundancy 
triggered by multiple views, it is desirable to combine 
multi-view data in a principled manner and perform 
feature selection to obtain consensus and discrimina¬ 
tive low dimensional feature representations. 


4.2 Modeling View Correlations 

Recent years have witnessed many research efforts de¬ 
voted to the integration of feature selection and multi¬ 
view learning. Tang et al. study multi-view feature se¬ 
lection in the unsupervised setting by constraining that 
similar data instances from each view should have sim¬ 
ilar pseudo-class labels [54 . Considering brain disorder 
identification, different neuroimaging features may cap¬ 
ture different but complementary characteristics of the 
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data. For example, the voxel-based tensor features con¬ 
vey the global information, while the ROI-based Auto¬ 
mated Anatomical Labeling (AAL) [57] features sum¬ 
marize the local information from multiple represen¬ 
tative brain regions. Incorporating these data and ad¬ 
ditional non-imaging data sources can potentially im¬ 
prove the prediction. For Alzheimer’s disease (AD) clas¬ 
sification, Ye et al. propose a kernel-based method for 
integrating heterogeneous data, including tensor and 
AAL features from MRI images, demographic informa¬ 
tion and genetic information [68: . The kernel framework 
is further extended for selecting features (biomarkers) 
from heterogeneous data sources that play more signif¬ 
icant roles than others in AD diagnosis. 

Huang et al. propose a sparse composite linear dis¬ 
criminant analysis model for identification of disease- 
related brain regions of AD from multiple data sources 
[25] . Two sets of parameters are learned: one represents 
the common information shared by all the data sources 
about a feature, and the other represents the specific 
information only captured by a particular data source 
about the feature. Experiments are conducted on the 
PET and MRI data which measure structural and func¬ 
tional aspects, respectively, of the same AD pathology. 
However, the proposed approach requires the input as 
the same set of variables from multiple data sources. 
Xiang et al. investigate multi-source incomplete data 
for AD and introduce a unified feature learning model 
to handle block-wise missing data which achieves simul¬ 
taneous feature-level and source-level selection [64] . 

For modeling view correlations, in general, a coeffi¬ 
cient is assigned for each view, either at the view-level 
or feature-level. For example, in multiple kernel learn¬ 
ing, a kernel is constructed from each view and a set of 
kernel coefficients are learned to obtain an optimal com¬ 
bined kernel matrix. These approaches, however, fail to 
explicitly consider correlations between features. 


4.3 Modeling Feature Correlations 

One of the key issues for multi-view classification is to 
choose an appropriate tool to model features and their 
correlations hidden in multiple views, since this directly 
determines how information will be used. In contrast 
to modeling on views, another direction for modeling 
multi-view data is to directly consider the correlations 
between features from multiple views. Since taking the 
tensor product of their respective feature spaces cor¬ 
responds to the interaction of features from multiple 
views, the concept of tensor serves as a backbone for 
incorporating multi-view features into a consensus rep¬ 
resentation by means of tensor product, where the com¬ 
plex multiple relationships among views are embedded 


Modeling 


Feature selection 


View 1 

View 2 

View 3 



Method (a) 



Method (b) 




Fig. 6 Schematic view of the key differences among three 
strategies of multi-view feature selection [6j. 


within the tensor structures. By mining structural in¬ 
formation contained in the tensor, knowledge of multi¬ 
view features can be extracted and used to establish a 
predictive model. 

Smalter et al. formulate the problem of feature selec¬ 
tion in the tensor product space as an integer quadratic 
programming problem m- However, this method is 
computationally intractable on many views, since it di¬ 
rectly selects features in the tensor product space re¬ 
sulting in the curse of dimensionality, as the method (6) 
shown in Figure [6] Cao et al. propose to use a tensor- 
based approach to model features and their correlations 
hidden in the original multi-view data [6]. The opera¬ 
tion of tensor product can be used to bring ra-view 
feature vectors of each instance together, leading to 
a tensorial representation for common structure across 
multiple views, and allowing us to adequately diffuse re¬ 
lationships and encode information among multi-view 
features. In this manner, the multi-view classification 
task is essentially transformed from an independent do¬ 
main of each view to a consensus domain as a tensor 
classification problem. 

By using Xi to denote the dataset of 

labeled multi-view instances can be represented asP = 
Note that each multi-view instance Xi 
is an rath-order tensor that lies in the tensor product 
space _ Based on the definitions of inner prod¬ 

uct and tensor norm, multi-view classification can be 
formulated as a global convex optimization problem in 
the framework of supervised tensor learning [55;. This 
model is named as multi-view SVM [6], and it can be 
solved with the use of optimization techniques devel¬ 
oped for SVM. 

Furthermore, a dual method for multi-view feature 
selection is proposed in [6. that leverages the relation¬ 
ship between original multi-view features and recon¬ 
structed tensor product features to facilitate the im- 
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plementation of feature selection, as the method (c) in 
Figure [6] It is a wrapper model which selects useful 
features in conjunction with the classifier and simulta¬ 
neously exploits the correlations among multiple views. 
Following the idea of SVM-based recursive feature elim¬ 
ination ED, multi-view feature selection is consistently 
formulated and implemented in the framework of multi¬ 
view SVM. This idea can extend to include lower order 
feature interactions and to employ a variety of loss func¬ 
tions for classification or regression ED- 


5 Future Work 

The human brain is one of the most complicated bi¬ 
ological structures in the known universe. While it is 
very challenging to understand how it works, especially 
when disorders and diseases occur, dozens of leading 
technology firms, academic institutions, scientists, and 
other key contributors to the field of neuroscience have 
devoted themselves to this area and made significant 
improvements in various dimension^] Data mining on 
brain disorder identification has become an emerging 
area and a promising research direction. 

This paper provides an overview of data mining ap¬ 
proaches with applications to brain disorder identifica¬ 
tion which have attracted increasing attention in both 
data mining and neuroscience communities in recent 
years. A taxonomy is built based upon data represen¬ 
tations, z.e., tensor imaging data, brain network data 
and multi-view data, following which the relationships 
between different data mining algorithms and different 
neuroimaging applications are summarized. We briefly 
present some potential topics of interest in the future. 

Bridging heterogeneous data representations. 
As introduced in this paper, we can usually derive data 
from neuroimaging experiments in three representations, 
including raw tensor imaging data, brain network data 
and multi-view vector-based data. It is critical to study 
how to train a model on a mixture of data representa¬ 
tions, although it is very challenging to combine data 
that are represented in tensor space, vector space and 
graph space, respectively. There is a straightforward 
idea of defining different kernels on different feature 
spaces and combing them through multi-kernel algo¬ 
rithms. However it is usually hard to interpret the re¬ 
sults. The concept of side view has been introduced to 
facilitate the process of mining brain networks, which 
may also be used to guide supervised tensor learning. 
It is even more interesting if we can learn on tensors 
and graphs simultaneously. 

2 http://www.whitehouse.gov/BRAIN 



Fig. 7 A bioinformatics heterogeneous information network 
schema. 


Integrating multiple neuroimaging modalities. 

There are a variety of neuroimaging techniques avail¬ 
able characterizing subjects from different perspectives 
and providing complementary information. For exam¬ 
ple, DTI contains local microstructural characteristics 
of water diffusion; structural MRI can be used to delin¬ 
eate brain atrophy; fMRI records BOLD response re¬ 
lated to neural activity; PET measures metabolic pat¬ 
terns [63 . Based on such multimodality representation, 
it is desirable to find useful patterns with rich seman¬ 
tics. For example, it is important to know which connec¬ 
tivity between brain regions is significant in the sense 
of both structure and functionality. On the other hand, 
by leveraging the complementary information embed¬ 
ded in the multimodality representation, better perfor¬ 
mance on disease diagnosis can be expected. 

Mining bioinformatics information networks. 

Bioinformatics network is a rich source of heterogeneous 
information involving disease mechanisms, as shown in 
Figure [7] The problems of gene-disease association and 
drug-target binding prediction have been studied in the 
setting of heterogeneous information networks [8l[3Tj . 
For example, in gene-disease association prediction, dif¬ 
ferent gene sequences can lead to certain diseases. Re¬ 
searchers would like to predict the association relation¬ 
ships between genes and diseases. Understanding the 
correlations between brain disorders and other diseases 
and the causality between certain genes and brain dis¬ 
eases can be transformative for yielding new insights 
concerning risk and protective relationships, for clar¬ 
ifying disease mechanisms, for aiding diagnostics and 
clinical monitoring, for biomarker discovery, for iden¬ 
tification of new treatment targets and for evaluating 
effects of intervention. 
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